<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="content-type" content="text/html; charset=utf-8" /><title /><meta name="generator" content="StarOffice/OpenOffice.org XSLT (http://xml.openoffice.org/sx2ml)" /><meta name="created" content="2008-10-15T21:43:23" /><meta name="changedby" content="Nicholas Fullagar" /><meta name="changed" content="2009-05-29T11:34:20" /><base href="." /><style type="text/css">
	@page { size: 8.5inch 11inch; margin-top: 0.7874inch; margin-bottom: 0.7874inch; margin-left: 0.7874inch; margin-right: 0.7874inch }
	table { border-collapse:collapse; border-spacing:0; empty-cells:show }
	td, th { vertical-align:top; }
	h1, h2, h3, h4, h5, h6 { clear:both }
	ol, ul { padding:0; }
	* { margin:0; }
	*.Caption { font-family:'Nimbus Roman No9 L'; font-size:12pt; margin-top:0.0835in; margin-bottom:0.0835in; font-style:italic; }
	*.Heading { font-family:'Nimbus Sans L'; font-size:14pt; margin-top:0.1665in; margin-bottom:0.0835in; }
	*.Index { font-family:'Nimbus Roman No9 L'; font-size:12pt; }
	*.List { font-family:'Nimbus Roman No9 L'; font-size:12pt; margin-top:0in; margin-bottom:0.0835in; }
	*.P1 { font-family:'Courier 10 Pitch'; font-size:10pt; font-weight:normal; }
	*.P10 { font-family:Courier; font-size:8pt; font-weight:normal; }
	*.P11 { font-family:'Nimbus Roman No9 L'; font-size:12pt; font-weight:normal; }
	*.P12 { font-family:'Courier 10 Pitch'; font-size:12pt; font-weight:normal; }
	*.P13 { font-family:'Courier 10 Pitch'; font-size:12pt; font-weight:bold; }
	*.P2 { font-family:'Courier 10 Pitch'; font-size:16pt; font-weight:bold; }
	*.P3 { font-family:'Courier 10 Pitch'; font-size:16pt; font-weight:bold; }
	*.P4 { font-family:'Nimbus Roman No9 L'; font-size:10pt; font-style:italic; }
	*.P5 { font-family:'Courier 10 Pitch'; font-size:10pt; }
	*.P6 { font-family:'Courier 10 Pitch'; font-size:12pt; }
	*.P7 { font-family:'Nimbus Roman No9 L'; font-size:12pt; font-style:italic; }
	*.P8 { font-family:'Nimbus Roman No9 L'; font-size:12pt; font-weight:bold; }
	*.P9 { font-family:'Nimbus Roman No9 L'; font-size:12pt; font-weight:normal; }
	*.Standard { font-family:'Nimbus Roman No9 L'; font-size:12pt; }
	*.Textbody { font-family:'Nimbus Roman No9 L'; font-size:12pt; margin-top:0in; margin-bottom:0.0835in; }
	*.T1 { font-weight:bold; }
	*.T2 { font-family:'Courier 10 Pitch'; font-weight:normal; }
	*.T3 { font-family:'Courier 10 Pitch'; font-weight:bold; }
	*.T4 { font-family:'Courier 10 Pitch'; }
	*.T5 { font-style:italic; }
	*.T6 { font-style:italic; font-weight:bold; }
	*.T7 { font-weight:normal; }
	</style></head><body dir="ltr"><p class="P1">Copyright 2008, Google Inc. </p><p class="P1">All rights reserved. </p><p class="P1"> </p><p class="P1">Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: </p><p class="P1"> </p><p class="P1">    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.</p><p class="P1">    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.</p><p class="P1">    * Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.</p><p class="P1"> </p><p class="P1">THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. </p><p class="P1"> </p><p class="P2"> </p><p class="P3"> </p><p class="P2">XRay: A Hierarchical Profiler for Native Client </p><p class="Standard"> </p><p class="Standard"><span class="T1">Objective:</span> To provide Native Client applications the ability to quickly generate hierarchical profiles, with cycle timing information and optional user supplied annotations.  XRay requires minimal code changes and once added to a project, should easily facilitate a rapid edit-compile-run-profile cycle.  At runtime, generating capture reports using XRay should require less than one second of cpu time -- even for large multi-frame, multi-megabyte captures.</p><p class="Standard"> </p><p class="P4">Note: The .xhtml version of this document is derived from the OpenOffice .odt file and provided as a convenience.  Edits should be performed on the .odt file and re-exported to the .xhtml version.</p><p class="P5"> </p><p class="Standard"><span class="T1">Introduction: </span> Understanding the performance of large bodies of code is a difficult task without the use of a profiler.  XRay is a profiling library designed to be user customizable via extending either the source code directly, or running user scripts on the output report.  It relies on the gcc compiler's option <span class="T2">-finstrument-functions</span>, which implicitly adds calls to an entry and exit function for every compilied function.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Setup: </span> A small amount of work is required to integrate XRay into an application.  It is recommended to setup a seperate build target for profiling.  (See the <span class="T2">xray</span> build target in  native_client/common/Makefile.mk)  This target should enable appropriate compiler optimizations, profiling options, preprocessor definitions, include the XRay runtime code, and linker mapfile generation.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Preprocessor defines:</span>  Two preprocessor defines are needed for XRay.  One is to enable the runtime profiler itself, which will consume extra memory and impact performance slightly.  The other option controls whether or not user supplied annotations should be compiled into code.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Compiler optimizations:</span>  A profiled build should enable optimizations to make it perform as close to release code as possible.  For gcc and g++, a reasonable starting point is <span class="T2">-O2</span><span class="T1">  </span>Floating point heavy applications may want to also enable use of SSE instructions by specifying <span class="T1"> </span><span class="T2">-mfpmath=sse -msse</span> </p><p class="Standard"> </p><p class="Standard"><span class="T1">Profiling options:</span>  An XRay profiled build should enable the gcc code generation option <span class="T2">-finstrument-functions</span><span class="T3"> </span>and <span class="T2">-falign-functions=32</span>.  These options add the required implicit calls to <span class="T4">__cyg_profile_func_enter</span> and <span class="T4">__cyg_profile_func_exit</span> (provided for you by the XRay library)  XRay requires all functions to be aligned to at least 16 bytes (32 bytes are recommended) in order to effectively pack the function address into the tracebuffer during capture.  For Native Client, this is a non-issue, since functions are aligned by at least this much anyway.</p><p class="Standard"> </p><p class="Standard"><span class="T1">The XRay Runtime Library:</span>  Include the files <span class="T4">xray.c stringpool.c hashtable.c &amp; symtable.c</span>,  in your build.  These files are small and, unless space is an extreme requirement, can be included in all builds.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Enable Linker Options and Post Build Steps: </span> XRay will require a mapfile to match function addresses to function names.  The linker option <span class="T4">-Map myapp.map</span> will output a name mangled mapfile.  Most people find mangled names difficult to read, so a post-build step to de-mangle names via the GNU's <span class="T4">c++filt</span> utility is recommended.  It is also recommended to output static names to the mapfile, or else the trace output will be unable to list names for static functions.  Do this by #defining STATIC to nothing for profile builds, but to be static for other builds, and replacing all use of static in your code with <span class="T4">STATIC</span>.</p><p class="Standard"> </p><p class="P6">  #ifdef (XRAY)</p><p class="P6">  #define STATIC</p><p class="P6">  #else</p><p class="P6">  #define STATIC static</p><p class="P6">  #endif</p><p class="P6"> </p><p class="Standard"><span class="T1">Runtime: </span> A small amount of runtime code will need to be added to the user application.  a)  Initialization of XRay, which will load the mapfile, initialize the symbol table, and initialize the tracebuffer.  b)  The area of code to profile will need to make calls to the XRay library telling it to start and stop profile captures into the tracebuffer.  A profile can consist of one or more 'frames.'  c)  When capturing has completed, a text based report file is written out.  This report is generated with an explicit call to <span class="T4">XRaySaveReport()</span>, so it can be driven from an external event (keypress, unusually long frame, after a certain amount of elapsed time, etc.)  And finally d) shutdown XRay to release resources.</p><p class="Standard"> </p><p class="P7">Important:  When used with Native Client, make sure to launch your profiled application using <span class="T4">sel_ldr</span>'s <span class="T4">-d</span> command line argument, which will enable debug mode and allow file I/O.  Otherwise, XRay won't be able to read the symbol table and/or save the trace report.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Overhead:</span>  An empty function with XRay profiling will require about 100 cycles to add profiling information to the trace buffer.  With optimizations enabled, a profile build will have average performance somewhere between a debug and a release build.  Optional annotations will add 1000 cycles per annotation (roughly eqv. to a sprintf and a strcpy for each annotation.)  Normally annotations are disabled and compile to a no-operation.  The <span class="T4">XRAY_ANNOTATE</span> preprocessor define will allow include annotations.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Inline Functions:</span>  XRay profiles inline functions.  This also means that XRay overhead is added to  them.  This can be disastrous for tiny inlined functions such as get/set class member functions, which would normally reduce down to one or two cycles of move instructions.  With the XRay trace buffer overhead, these functions will balloon to 100 cycles.  An explicit <span class="T4">XRAY_NO_INSTRUMENT</span> modifier can be added to a function declaration to exclude it.  The time it consumes will still be accounted for in the profiled parent functions.  Future versions of nacl-gcc and nacl-g++ may include <span class="T2">-finstrument-functions-except-inlines</span>, which will avoid the requirement to explicitly mark inline functions <span class="T4">XRAY_NO_INSTRUMENT</span>.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Limitations: </span> XRay has a few practical limitations.  At the moment, a trace entry requires 16 bytes.  8 bits are for representing call tree depth, so XRay won't report functions deeper than 255 calls.  24 bits are for representing the function address (which has to be aligned to at least 16 bytes), so XRay can only profile up to 28 bits of code address space – which should be enough for most Native Client applications.  64 bits of the trace buffer entry are used for recording RDTSC 'ticks' which is the RDTSC delta.  Even with 64 bits this counter can periodically overflow; XRay makes no attempt to detect and account for this situation.  Annotations are limited to 1024 characters per trace entry.  XRay cannot (currently) profile code with exceptions or setjmp/longjmp usage.  It requires x86 cpus to support the RDTSC instruction.  While XRay can correctly operate in a multi-threaded application, it can only profile a single thread at a time.  There are also some limitations with RDTSC – it reports cycles globally, so other threads can interrupt a profile and their time will show up in functions of the thread being profiled.  On multi-processor systems RDTSC might not be exactly synchronized across cores.  In practice, limitations with RDTSC can be worked around with processor affinity and repeated captures.  Slow code usually overshadows these limitations by orders of magnitude.</p><p class="Standard"> </p><p class="Standard">Deep recursion distortion:  currently XRay makes no attempt to make any guesstimates as to the amount of overhead it imposes on profiled code.  In practice it might be around 100 cycles per function.  This isn't bad when profiling near the root of the call tree, but can start to distort timings in deeper call trees.  This is especially evident in deep recursion of cheap functions, where the profile overhead is compounded, adding significant distortion to the execution time.  If the code is tail recursive, consider re-writing it to eliminate the recursion – your trace outputs will be more concise and easier to analyze.  Other types of recursion may be more difficult to eliminate.  Limiting the stack depth can also improve the timing of deep recursion, but does not eliminate the overhead altogether – XRay still has to track and check the depth value.  Future versions of XRay may attempt to recursively subtract its own overhead, but this is still only a guess, as modern out-of-order execution processors make predicting the actual overhead almost impossible – it will be unique to the surrounding code and how well it slots in with the instructions added by XRay.  XRay could self-profile a best minimal case at initialization and subtract that across the board during report generation, but if we are too aggressive, some functions will have negative cycle counts.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Pruning The Trace: </span> There are currently three ways to prune a large trace capture.  First, most non-recursive code won't have very deep call trees – maybe 16 deep at the most.  But when recursion does enter the picture, the call tree depth can surge up to 255 entries deep.  Most of the time this deep recursion just pollutes the trace with an unnecessary amount of information.  So one option in XRay is to stop capture when a call tree exceeds a user specified depth.  In most cases, it is sufficient to do a capture run with the limit set to 255, identify where recursion occurs, and then set the depth limiter to a reasonable amount depending on when the recursion tends to occur.  A reasonable guess for most code would be a call depth between 16 and 32 entries.  Again, the time spent in deep call tress is still reported in the profiled parent functions.  This is a pruning option that occurs at <span class="T5">trace capture time.</span></p><p class="Standard"> </p><p class="Standard">The second option for pruning the trace output occurs when traces are dumped to a file.  XRay will automatically normalize the amount of time spent in a function as a percent of the total frame.  Most of the time, functions that only require a tiny percent of the total frame are... uninteresting – there are bigger fish to be frying.  So it is possible to instruct XRay to explicitly omit all functions (and their children) from the trace output if their percent of the captured frame is less than n percent, where n is specified by the user.  Again, the time spent in these omitted functions is still reflected (and reported) in the parent functions.  If a function is taking a suspicious amount of time, this filter can be temporarily lifted to reveal who is taking this time.  This is a pruning option that occurs at <span class="T5">report time.</span></p><p class="Standard"> </p><p class="Standard">The third option has already been alluded to earlier – any function can be removed from the trace by adding <span class="T4">XRAY_NO_INSTRUMENT</span> to its declaration.  This will omit profiling for that function, but still include profiling for any children (assuming they aren't filtered themselves)  It is recommended that  short inline 'leaf' functions be marked as <span class="T4">XRAY_NO_INSTRUMENT.</span>  (Usually you don't want to profile an inline member get function that boils down to one or two cycles of move instructions.  A couple of  exceptions to this is when you need the function to appear in the trace report but don't care about the cycle penalty imposed by profiling, or the parent function is showing unexpectedly slow cycle times and you suspect these small inline functions might actually have poor cache performance.)  This is a pruning option that occurs at <span class="T5">compile time.</span></p><p class="Standard"> </p><p class="Standard">By mixing the above three pruning methods -- compile time, capture time, and report time -- large bodies of code can be reduced down to manageable trace outputs. </p><p class="Standard"> </p><p class="Standard"><span class="T1">Adding User Annotations:</span>  XRay features an optional, but useful, API for explicitly annotating the call tree.  <span class="T4">XRayAnnotate(const char *str, ...)</span> is identical to <span class="T4">printf</span> in the arguments it takes, but instead of displaying the string on stdout, the string is copied into the trace buffer as part of the capture.  This annotation is attached to the function in which the call to XRayAnnotate appeared in.  Later, if a trace report is generated, the captured string is output adjacent to the function in the report.</p><p class="Standard"> </p><p class="Standard">For example: </p><p class="Standard"> </p><p class="P6">  int foo(int x) {</p><p class="P6">    int r;</p><p class="P6">    XRayAnnotate(“Value of x: %d”, x);</p><p class="P6">    r = x * x * x;</p><p class="P6">    XRayAnnotate(“  r: %d”, r);</p><p class="P6">    return r;</p><p class="P6">  }</p><p class="P6"> </p><p class="P6">  void bar() {</p><p class="P6">    foo(3);</p><p class="P6">    foo(2);</p><p class="P6">    foo(1);</p><p class="P6">  }</p><p class="P6"> </p><p class="Standard">In the trace report generated by XRay, the annotation will reveal the value of x and r for every call to function foo(). </p><p class="Standard"> </p><p class="Standard"> </p><p class="Standard"> </p><p class="Standard">(continued) </p><p class="Standard"> </p><p class="P6">Address         Ticks   Percent   Function &lt;optional annotation&gt;</p><p class="P6">---------------------------------------------------------------- </p><p class="P6">... </p><p class="P6">0x00010700       3300       0.9   bar()</p><p class="P6">0x00010310       1002       0.3    foo()  <span class="T6">Value of x: 3  r: 27</span></p><p class="P6">0x00010310       1004       0.3    foo()  <span class="T6">Value of x: 2  r: 8</span></p><p class="P6">0x00010310       1001       0.3    foo()  <span class="T6">Value of x: 1  r: 1</span></p><p class="P6">... </p><p class="Standard"> </p><p class="Standard"><span class="T1">Always On Profiling:</span>  XRay has reasonably minimal performance overhead.  In most cases, it can be left in an “always capture” mode.  The trace buffer is a large ring buffer, and constantly overwrites itself.  It is important to select the appropriately sized trace buffer for your application -- too small a trace buffer will result in incomplete profiles.  Too large a trace buffer, and you will find yourself having to surf too much profile data.  With a correctly selected trace buffer size at initialization, many frames of a manageable amount of profile information will always be available in the ring buffer.</p><p class="Standard"> </p><p class="Standard">An important feature of XRay is taking advantage of <span class="T5">always on</span> profiling.  Some profilers are dormant until the trigger occurs, at which point they perform a capture, but at much reduced execution speed of the application.  These profilers make it very hard to capture rarely occurring hiccups in performance – because you have to guess ahead of time when to trigger the capture!  XRay can be configured to <span class="T5">always</span> capture and if a hiccup occurs, a temporal trigger can dump the profile results <span class="T5">after-the-fact</span>, making it very easy to capture rare once-in-a-blue-moon performance stutters &amp; heartbeats. </p><p class="Standard"> </p><p class="Standard"><span class="T1">Analyzing Trace Ouput:  </span>First, it is recommended to capture a few trace reports using aggressive filtering.  Set the report filtering at two percent or higher.  Set the call tree depth to 4 calls.  Apply <span class="T4">XRAY_NO_INSTRUMENT</span> to the top few inline functions that are otherwise polluting the trace report.  This should yield a trace report that is only a few pages, even for huge bodies of code.  Spend some time looking over this high level “birds-eye” view.  Some surprises might be quickly evident.</p><p class="Standard"> </p><p class="Standard">Next, burrow down further into the top level parent functions that seem to be taking longer than you expect.  Take a moment to look at the text format of the trace capture output.  Consider writing grep, awk, perl or python scripts to help you analyze larger trace captures.</p><p class="Standard"> </p><p class="Standard"><span class="T1">Observations: </span> A large software project usually contains a mix of engineers with different skill sets and performance backgrounds.  Usually, surprises in performance come from unexpected areas.  The usual suspects:  garbage collection spikes, poorly written code in scripting languages, unexpected use of algorithms (especially overloaded [] in C++, where one would assume O(1) lookup, but the underlying implementation uses a linear... or worse... lookup on the index), well intentioned data structures that have poor cache performance, code that uses strings instead of enumerations, data structures that use too many cache in-coherent pointers instead of cache coherent arrays, code that unexpectedly breaks the “ways” of the cache, pathological branch prediction, numerical code that looks trivial but ends up exercising software floating point emulation or rounding modes, use of atof() when parsing a large number of floats, etc...</p><p class="Standard"> </p><p class="Standard"><span class="T1">Portability:</span>  All parts of XRay, at the moment, are written in Native Client.  It requires use of the RDTSC instruction, and file I/O to dump the trace capture, which can be enabled by using the “-d” debug flag on sel_ldr when launching NaCl applications.  It can be used equally well on Windows, Mac and Linux.</p><p class="Standard"> </p><p class="P8">Conclusion:<span class="T7">  XRay in its current state is about 20 pages of C code.  It is intended to be a minimal -- but highly useful and easy to understand -- hierarchical profiler.  The annotation ability also provides some limited debugging capability.</span></p><p class="P9"> </p><p class="P8">Future work: <span class="T7"> support for more reporting options, binary dump of trace capture for deeper off-line report / analysis, real-time graphing via shared memory ring buffer and concurrent profiling graph utility.  Runtime self-modifying code to disable/enable </span><span class="T2">__cyg_profile_func_enter()</span><span class="T7"> / </span><span class="T2">__cyg_profile_func_exit()</span><span class="T7"> function call overhead when not profiling, ability to profile multiple threads concurrently.</span></p><p class="P9"> </p><p class="P9"> </p><p class="P9"> </p><p class="P9"><span class="T1">Example Trace Report on Quake</span>, using heavily pruned output.  Functions requiring less than 0.5% of the frame are pruned.  The frame table of contents and the first two frames of rendering are shown here.  Following this example trace report is a short walk through demonstrating how to add XRay to a code base.   The report is in two parts, the first part is an overview of all the frames.  You can then use the label field to find the appropriate frame further down in the second half of the profile report. </p><p class="P9"> </p><p class="P9"> </p><p class="P10">Frame#      Total Ticks      Capture size    Annotations   Label</p><p class="P10">---------------------------------------------------------------------- </p><p class="P10">     0         38578023             18471              1   @@@frame0</p><p class="P10">     1         38609433             18403              1   @@@frame1</p><p class="P10">     2         38278584             18946              1   @@@frame2</p><p class="P10">     3         37807992             18817              1   @@@frame3</p><p class="P10">     4         38440449             19085              1   @@@frame4</p><p class="P10">     5         39003597             19055              1   @@@frame5</p><p class="P10">     6         41945193             19437              1   @@@frame6</p><p class="P10">     7         39021318             19132              1   @@@frame7</p><p class="P10">     8         38953836             19288              1   @@@frame8</p><p class="P10">     9         40055157             19583              1   @@@frame9</p><p class="P10">    10         41722308             21473              1   @@@frame10</p><p class="P10">    11         40208364             21394              1   @@@frame11</p><p class="P10">    12         40096296             21462              1   @@@frame12</p><p class="P10">    13         40166820             21880              1   @@@frame13</p><p class="P10">    14         39802320             22132              1   @@@frame14</p><p class="P10">    15         44599851             23978              1   @@@frame15</p><p class="P10">    16         42858477             23524              1   @@@frame16</p><p class="P10">    17         42737040             22140              1   @@@frame17</p><p class="P10">    18         41868396             21986              1   @@@frame18</p><p class="P10">    19         42065118             22375              1   @@@frame19</p><p class="P10">    20         41432913             21610              1   @@@frame20</p><p class="P10">    21         41432085             21366              1   @@@frame21</p><p class="P10">    22         41422329             21438              1   @@@frame22</p><p class="P10">    23         42658218             22965              1   @@@frame23</p><p class="P10">    24         42124986             23416              1   @@@frame24</p><p class="P10">    25         40656915             23409              1   @@@frame25</p><p class="P10">    26         40898682             24175              1   @@@frame26</p><p class="P10">    27         40306761             22569              1   @@@frame27</p><p class="P10">    28         39704139             22061              1   @@@frame28</p><p class="P10">    29         40285233             23210              1   @@@frame29</p><p class="P10">    30         40479561             22830              1   @@@frame30</p><p class="P10">    31         39923307             22013              1   @@@frame31</p><p class="P10">    32         40395078             22168              1   @@@frame32</p><p class="P10">    33         40016079             22648              1   @@@frame33</p><p class="P10">    34         40001832             22770              1   @@@frame34</p><p class="P10">    35         39875715             22758              1   @@@frame35</p><p class="P10">    36         40033116             22952              1   @@@frame36</p><p class="P10">    37         40063905             23049              1   @@@frame37</p><p class="P10">    38         40234455             23068              1   @@@frame38</p><p class="P10">    39         40101408             23211              1   @@@frame39</p><p class="P10">    40         40385430             24121              1   @@@frame40</p><p class="P10">    41         49233465             27678              1   @@@frame41</p><p class="P10">    42         48870225             27874              1   @@@frame42</p><p class="P10">    43         49595139             27578              1   @@@frame43</p><p class="P10">    44         49410378             28399              1   @@@frame44</p><p class="P10">    45         48723327             28485              1   @@@frame45</p><p class="P10">    46         48720195             28765              1   @@@frame46</p><p class="P10">    47         48202893             28018              1   @@@frame47</p><p class="P10">    48         48704220             27988              1   @@@frame48</p><p class="P10">    49         48594735             27848              1   @@@frame49</p><p class="P10">    50         48975957             27802              1   @@@frame50</p><p class="P10">    51         49999113             28864              1   @@@frame51</p><p class="P10">    52         50078736             30096              1   @@@frame52</p><p class="P10">    53         52864776             36456              1   @@@frame53</p><p class="P10">    54         54470799             38659              1   @@@frame54</p><p class="P10">    55         53051958             39805              1   @@@frame55</p><p class="P10">    56         53089164             41015              1   @@@frame56</p><p class="P10">    57         53421876             41180              1   @@@frame57</p><p class="P10">    58         53214642             41225              1   @@@frame58</p><p class="P10">---------------------------------------------------------------------- </p><p class="P10">XRay: 59 frame(s)    1468103 total capture(s)</p><p class="P10"> </p><p class="P10"> </p><p class="P10"> </p><p class="P10">====================================================================== </p><p class="P10">label @@@frame0 </p><p class="P10"> </p><p class="P10">   Address        Ticks   Percent      Function  &lt;optional annotation&gt;</p><p class="P10">---------------------------------------------------------------------- </p><p class="P10">0x0002FAA0     38565216     100.0      Host_Frame</p><p class="P10">0x0002F500     38564388     100.0       _Host_Frame</p><p class="P10">0x000759A0     38335896      99.4        SCR_UpdateScreen</p><p class="P10">0x0008CC80     23560650      61.1         V_RenderView</p><p class="P10">0x00067920     23533893      61.0          R_RenderView</p><p class="P10">0x00067100     23531589      61.0           R_RenderView_</p><p class="P10">0x000664C0     18393984      47.7            R_EdgeDrawing</p><p class="P10">0x000604C0       258417       0.7             R_RenderWorld</p><p class="P10">0x0005FEA0       257841       0.7              R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       255933       0.7               R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       254799       0.7                R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       254088       0.7                 R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       253143       0.7                  R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       250380       0.6                   R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       248418       0.6                    R_RecursiveWorldNode</p><p class="P10">0x000643C0     18058671      46.8             R_ScanEdges</p><p class="P10">0x00025800      9790209      25.4              D_DrawSurfaces</p><p class="P10">0x000290C0      1767573       4.6               D_DrawSpans8</p><p class="P10">0x00028EE0       343350       0.9               D_DrawZSpans</p><p class="P10">0x000290C0       640764       1.7               D_DrawSpans8</p><p class="P10">0x000290C0      3491316       9.1               D_DrawSpans8</p><p class="P10">0x00028EE0       616662       1.6               D_DrawZSpans</p><p class="P10">0x000290C0       329067       0.9               D_DrawSpans8</p><p class="P10">0x000290C0       213138       0.6               D_DrawSpans8</p><p class="P10">0x000290C0       331992       0.9               D_DrawSpans8</p><p class="P10">0x000290C0       991287       2.6               D_DrawSpans8</p><p class="P10">0x00025800      5613084      14.5              D_DrawSurfaces</p><p class="P10">0x000290C0      1132128       2.9               D_DrawSpans8</p><p class="P10">0x00028EE0       221553       0.6               D_DrawZSpans</p><p class="P10">0x000290C0       497043       1.3               D_DrawSpans8</p><p class="P10">0x000290C0       590841       1.5               D_DrawSpans8</p><p class="P10">0x000290C0      1223667       3.2               D_DrawSpans8</p><p class="P10">0x00028EE0       227484       0.6               D_DrawZSpans</p><p class="P10">0x000290C0       380529       1.0               D_DrawSpans8</p><p class="P10">0x000290C0       886743       2.3               D_DrawSpans8</p><p class="P10">0x00066BE0      4103073      10.6            R_DrawEntitiesOnList</p><p class="P10">0x0005FB20      1559475       4.0             R_AliasDrawModel</p><p class="P10">0x0005F800      1553427       4.0              R_AliasPreparePoints</p><p class="P10">0x0005FB20      2117061       5.5             R_AliasDrawModel</p><p class="P10">0x0005F800      2113443       5.5              R_AliasPreparePoints</p><p class="P10">0x0005FB20       256095       0.7             R_AliasDrawModel</p><p class="P10">0x0005EDC0       251388       0.7              R_AliasPrepareUnclippedPoints</p><p class="P10">0x00028E20       243297       0.6               D_PolysetDraw</p><p class="P10">0x00028B80       243072       0.6                D_DrawNonSubdiv</p><p class="P10">0x00066840       919269       2.4            R_DrawViewModel</p><p class="P10">0x0005FB20       907380       2.4             R_AliasDrawModel</p><p class="P10">0x0005F800       902331       2.3              R_AliasPreparePoints</p><p class="P10">0x00089CA0     14754924      38.2         VID_Update</p><p class="P10">0x00089A80     14752476      38.2          (null)  // static funciton</p><p class="P10"> </p><p class="P10">====================================================================== </p><p class="P10">label @@@frame1 </p><p class="P10"> </p><p class="P10">   Address        Ticks   Percent      Function  &lt;optional annotation&gt;</p><p class="P10">---------------------------------------------------------------------- </p><p class="P10">0x0002FAA0     38596356     100.0      Host_Frame</p><p class="P10">0x0002F500     38595402     100.0       _Host_Frame</p><p class="P10">0x000759A0     38373732      99.4        SCR_UpdateScreen</p><p class="P10">0x0008CC80     23582070      61.1         V_RenderView</p><p class="P10">0x00067920     23553576      61.0          R_RenderView</p><p class="P10">0x00067100     23551164      61.0           R_RenderView_</p><p class="P10">0x000664C0     18382464      47.6            R_EdgeDrawing</p><p class="P10">0x000604C0       264087       0.7             R_RenderWorld</p><p class="P10">0x0005FEA0       263484       0.7              R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       261027       0.7               R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       259641       0.7                R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       258867       0.7                 R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       258111       0.7                  R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       255168       0.7                   R_RecursiveWorldNode</p><p class="P10">0x0005FEA0       253575       0.7                    R_RecursiveWorldNode</p><p class="P10">0x000643C0     18039600      46.7             R_ScanEdges</p><p class="P10">0x00025800      9837378      25.5              D_DrawSurfaces</p><p class="P10">0x000290C0      1650078       4.3               D_DrawSpans8</p><p class="P10">0x00028EE0       325638       0.8               D_DrawZSpans</p><p class="P10">0x000290C0       641736       1.7               D_DrawSpans8</p><p class="P10">0x000290C0      3468933       9.0               D_DrawSpans8</p><p class="P10">0x00028EE0       613539       1.6               D_DrawZSpans</p><p class="P10">0x000290C0       349830       0.9               D_DrawSpans8</p><p class="P10">0x000290C0       265491       0.7               D_DrawSpans8</p><p class="P10">0x000290C0       333585       0.9               D_DrawSpans8</p><p class="P10">0x000290C0      1089666       2.8               D_DrawSpans8</p><p class="P10">0x00028EE0       212247       0.5               D_DrawZSpans</p><p class="P10">0x00025800      5571315      14.4              D_DrawSurfaces</p><p class="P10">0x000290C0      1059129       2.7               D_DrawSpans8</p><p class="P10">0x00028EE0       215793       0.6               D_DrawZSpans</p><p class="P10">0x000290C0       510471       1.3               D_DrawSpans8</p><p class="P10">0x000290C0       571500       1.5               D_DrawSpans8</p><p class="P10">0x000290C0      1177155       3.0               D_DrawSpans8</p><p class="P10">0x00028EE0       213192       0.6               D_DrawZSpans</p><p class="P10">0x000290C0       433746       1.1               D_DrawSpans8</p><p class="P10">0x000290C0       920241       2.4               D_DrawSpans8</p><p class="P10">0x00066BE0      4149441      10.7            R_DrawEntitiesOnList</p><p class="P10">0x0005FB20      1561581       4.0             R_AliasDrawModel</p><p class="P10">0x0005F800      1555011       4.0              R_AliasPreparePoints</p><p class="P10">0x0005FB20      2168361       5.6             R_AliasDrawModel</p><p class="P10">0x0005F800      2164608       5.6              R_AliasPreparePoints</p><p class="P10">0x0005FB20       252918       0.7             R_AliasDrawModel</p><p class="P10">0x0005EDC0       248139       0.6              R_AliasPrepareUnclippedPoints</p><p class="P10">0x00028E20       240201       0.6               D_PolysetDraw</p><p class="P10">0x00028B80       239967       0.6                D_DrawNonSubdiv</p><p class="P10">0x00066840       897588       2.3            R_DrawViewModel</p><p class="P10">0x0005FB20       886122       2.3             R_AliasDrawModel</p><p class="P10">0x0005F800       881136       2.3              R_AliasPreparePoints</p><p class="P10">0x00089CA0     14769072      38.3         VID_Update</p><p class="P10">0x00089A80     14767911      38.2          (null)  // static function</p><p class="P10"> </p><p class="P10">====================================================================== </p><p class="P10"> </p><p class="P10"> </p><p class="P10"> </p><p class="P10"> </p><p class="P10"> </p><p class="P11"><span class="T1">Example usage</span>:  This section will explain the required steps to add XRay to an existing code base.  All source files that need to make explicit calls to XRay functions (including XRayAnnotate) will need to #include “xray.h”</p><p class="P11"> </p><p class="P11">  Step a)  First, in your startup code, make a call to <span class="T4">XRayInit()</span>  Using Quake as an example, a good place the call <span class="T4">XRayInit()</span> is in sys_nacl.c, in function <span class="T4">main()</span>, right before it enters the application's main loop.</p><p class="P11"> </p><p class="P12">    // Step a</p><p class="P12">    // Initialize XRay and load symbols from mapfile</p><p class="P12">    XRayInit(4, 4000000, 100, “naclquake.map”)</p><p class="P12"> </p><p class="P11">  The first argument is the maximum stack depth XRay should record.  XRay can record stack depths of up to 255 entries.  By selecting smaller values, it is possible to prune call trees, especially if they contain deep recursion.</p><p class="P11">  The second argument is the number of entries you want to use for the trace buffer.  Each entry requires 16 bytes.  Small projects can probably get away with 100,000 entries.  Large projects can use much bigger buffers, as much as Native Client will allow.  Quake, for example, can use a buffer size of 4,000,000 entries.  Although most modern machines come equipped with plenty of physical memory, take care not to exceed it or profile results will be meaningless due to excessive paging.</p><p class="P11">  The third argument is the frame count.  This is an upper ceiling on how many frames you want to capture.  In Quake, a frame is a single video frame of animation.  Quake runs about 60 frames per second.  A frame doesn't have to be tied to video, it could be more abstract, like a database transaction.  In XRay, <span class="T4">XRayStartFrame()</span> / <span class="T4">XRayEndFrame()</span> pair define a frame.</p><p class="P11">  The last argument is the name of the mapfile generated by nacl-ld obtained by using the linker option <span class="T4">-Map myapp.map </span>Or, if your project uses nacl-gcc / nacl-g++ to link, use <span class="T4">-Wl,-Map,myapp.map  </span>If you find c++ name mangling to be inconvenient, use the GNU utility <span class="T4">c++filt</span> to demangle all the symbols in the mapfile.  See the <span class="T4">xray</span> build target options in the makefile native_client/common/Makefile.mk</p><p class="P11"> </p><p class="P11">  Step b)  Wrap the high level function you want to profile.  In Quake this is the function Host_Frame(time) in sys_nacl.c</p><p class="P11"> </p><p class="P12">    // Step b</p><p class="P12">    XRayStartFrame();</p><p class="P12">    Host_Frame(time);  // profile Quake main loop</p><p class="P12">    XRayEndFrame();</p><p class="P12"> </p><p class="P12">    // Step c</p><p class="P12">    // elsewhere in code, '-' keypress sets g_profile to true</p><p class="P12">    if (g_profile) {</p><p class="P12">      XRaySaveReport(“quake.profile”, 0.5f);</p><p class="P12">      g_profile = false;</p><p class="P12">    }</p><p class="P12"> </p><p class="P11">  Step c)  Figure out how you want to be able to trigger dumping a profile report.  One method is to bind an unused key.  When this key is pressed, set a global variable indicating that the user has requested a profile dump.  This needs to be deferred, because dumping the profile cannot occur from inside an <span class="T4">XRayStartFrame()</span> / <span class="T4">XRayEndFrame()</span>.  For example, profiles could be triggered when the user presses the minus (-) key on the numeric keypad if that key is normally unused by the application.  You will need to find where the application responds to keyboard events and add code indicating that the user has requested a profile dump.  If this event loop is called while the profiler is active (ie the code is between <span class="T4">XRayStartFrame()</span> and <span class="T4">XRayEndFrame()</span>) then you will need to set a global variable to indicate that after the frame completes, a profile report should be saved.  Do not try to save profile reports between <span class="T4">XRayStartFrame()</span> and <span class="T4">XRayEndFrame()</span>.</p><p class="P11"> </p><p class="P11">In code outside the <span class="T4">XrayStartFrame()</span> / <span class="T4">XrayEndFrame() </span>pair, if the user has requested a profile report, call <span class="T4">XRaySaveReport(const char *filename, float percent_cutoff)</span>  This will open the named file and write out a text based profile report, with some pruning depending on how aggressive the percent_cutoff value is.  As a starting point, it is reasonable to prune all functions that consume less than 1% of a frame – these small functions (most likely leaves on the calltree) usually aren't where big optimization gains will be made when profiling large, complex bodies of code.</p><p class="P11"> </p><p class="P12">    // Step d</p><p class="P12">    XRayShutdown();</p><p class="P12"> </p><p class="P11">  Step d)  Shutdown XRay in your application shutdown to release resources.</p><p class="P11"> </p><p class="P11">You will probably need to tweak a few of the options that prune the profile to get useful output.  Start by aggressively pruning to get a good bird's eye view of where time is being spent in the top level functions.  If an area of the application takes longer than you expect, especially as a percent of the other operations, relax the pruning constraints and take more profiles.  It shouldn't take long to drill down and figure out why code takes an unexpected amount of time.</p><p class="P11"> </p><p class="P11">Once you are familiar with this profiler, build-run-profile-edit sessions should go very quickly.  You can also use grep, awk, perl, or python on the generated reports, or modify XRay itself to extend its capabilities.</p><p class="P11"> </p><p class="P11"> </p><p class="P3"> </p><p class="P2"> </p><p class="P2">Appendix A - XRay Reference </p><p class="P11"> </p><p class="P11"> </p><p class="P13">void XRayInit(int stack_size, </p><p class="P13">              int buffer_size,</p><p class="P13">              int frame_count,</p><p class="P13">              const char *filename)</p><p class="P13"> </p><p class="P12">  inputs:</p><p class="P12"> </p><p class="P12">    stack_size   maximum call tree depth.  The profiler will not</p><p class="P12">                 add functions to the trace buffer whose depth in</p><p class="P12">                 the call stack exceed this value.</p><p class="P12"> </p><p class="P12">    buffer_size  size of the trace buffer, in units of trace entries.</p><p class="P12">                 A trace entry is 16 bytes, and holds either a single</p><p class="P12">                 function call, or a string fragment from an</p><p class="P12">                 annotation.  Small programs can use 100,000 entries.</p><p class="P12">                 Large programs can use 4,000,000 or more.</p><p class="P12"> </p><p class="P12">    frame_count  number of frames to capture before wrapping around.</p><p class="P12">                 This is the maximum number of frames to record at</p><p class="P12">                 any given time.  The actual number recorded can be</p><p class="P12">                 less (the user only took 5 frames of capture, or the</p><p class="P12">                 buffer_size isn't large enough to accommodate all</p><p class="P12">                 frames.)  If you only need a single profile, use</p><p class="P12">                 a value of 1.</p><p class="P12"> </p><p class="P12">    filename     filename of nacl-gcc generated mapfile.  Without</p><p class="P12">                 this file, the trace report won't have function</p><p class="P12">                 names.</p><p class="P12"> </p><p class="P12"> </p><p class="P13"> </p><p class="P13">void XRayShutdown() </p><p class="P12"> </p><p class="P12">  action:        shuts down and releases memory used by XRay.</p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRayStartFrame() </p><p class="P12"> </p><p class="P12">  action:        starts recording of next frame into trace buffer.</p><p class="P12">                 Only the thread that calls XRayStartFrame() will be</p><p class="P12">                 recorded.  This call is not re-entrant and cannot be</p><p class="P12">                 nested.</p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRayEndFrame() </p><p class="P13"> </p><p class="P12">  action:        stops recording into trace buffer.  Only the thread</p><p class="P12">                 that originally called XRayStartFrame() should call</p><p class="P12">                 XRayEndFrame().  This function must be called to</p><p class="P12">                 close the frame before opening the next frame or</p><p class="P12">                 saving a profile report.</p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRaySetAnnotationFilter(uint32_t filter) </p><p class="P13"> </p><p class="P12">  input:</p><p class="P12"> </p><p class="P12">    filter       Sets the annotation filter bit mask.  By default,</p><p class="P12">                 all annotation channels are enabled.  This is a </p><p class="P12">                 mechanism that can be used to filter annotations in</p><p class="P12">                 the trace output.  The filter can be different</p><p class="P12">                 developers, different areas of the code, or any</p><p class="P12">                 arbitrary filter – it is entirely up to the user.</p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRayAnnotate(const char *str, ...) </p><p class="P12"> </p><p class="P12">  input:</p><p class="P12"> </p><p class="P12">    str...       A printf style string with optional arguments.  The</p><p class="P12">                 output string is appended to the trace buffer.  In</p><p class="P12">                 the trace report, the string appears on the same</p><p class="P12">                 line the function who called it.  This is an easy</p><p class="P12">                 way to add simple debug information to the trace</p><p class="P12">                 report.</p><p class="P12"> </p><p class="P12">                 Keep in mind that each call consumes performance,</p><p class="P12">                 about 1000 RDTSC ticks.  If XRAY_ANNOTATE isn't</p><p class="P12">                 defined, XRayAnnotate() calls are reduced to null</p><p class="P12">                 code.</p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRayAnnotateFiltered(int mask, const char *str, ...) </p><p class="P12"> </p><p class="P12">  input:</p><p class="P12"> </p><p class="P12">    mask         A mask value that is tested against the current </p><p class="P12">                 filter set by XraySetAnnotationFilter.  If the mask</p><p class="P12">                 test fails, the annotation string isn't appended to</p><p class="P12">                 the trace and the execution time of this function is</p><p class="P12">                 significantly less.</p><p class="P12"> </p><p class="P12">    str...       printf style argument(s), same as <span class="T1">XRayAnnotate()</span></p><p class="P12"> </p><p class="P12"> </p><p class="P12"> </p><p class="P13">void XRaySaveReport(const char *filename, float cutoff) </p><p class="P12"> </p><p class="P12">  input:</p><p class="P12"> </p><p class="P12">    filename     output filename to save text profile report.</p><p class="P12"> </p><p class="P12">    cutoff       functions below this cutoff, expressed as a percent,</p><p class="P12">                 will not be output to the trace report.  This is a</p><p class="P12">                 handy way to filter out small, insignificant</p><p class="P12">                 functions.  Usually filtering out functions that</p><p class="P12">                 take less than 1% of a frame will produce a nicely </p><p class="P12">                 pruned report for fast bird's eye view analysis.  </p><p class="P12">                 When you want to generate the entire call tree</p><p class="P12">                 report, set this value to 0.0</p><p class="P12"> </p><p class="P12">  action:        Saves the trace report in text form to the specified</p><p class="P12">                 file.  This function must be called outside</p><p class="P12">                 XRayStartFrame() / XRayEndFrame()</p><p class="P12"> </p><p class="P12"> </p><p class="P11"> </p></body></html>